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Abstract 



Pocket computers are beginning to emerge that provide 
sufficient processing capability and memory capacity to 
run traditional desktop applications and operating sys- 
tems on them. The increasing demand placed on these 
systems by software is competing against the continu- 
ing trend in the design of low-power microprocessors to- 
wards increasing the amount of computation per unit of 
energy. Consequently, in spite of advances in low-power 
circuit design, the microprocessor is likely to continue 
to account for a significant portion of the overall power 
consumption of pocket computers. 

This paper investigates clock scaling algorithms on the 
Itsy, an experimental pocket computer that runs a com- 
plete, functional multitasking operating system (a ver- 
sion of Linux 2,0.30). We implemented a number of 
clock scaling algorithms that are used to adjust the pro- 
cessor speed to reduce the power used by the proces- 
sor. After testing these algorithms, we conclude that cur- 
rently proposed algorithms consistently fail to achieve 
their goal of saving power while not causing user appli- 
cations to change their interactive behavior. 



1 Introduction 



Dynamic clock frequency scaling and voltage scaling are 
two mechanisms that can reduce the power consumed by 
a computer. Both voltage scaling and frequency scaling 
are important; the power consumed by a component im- 
plemented in CMOS varies linearly with frequency and 
quadratically with voltage. 

To evaluate the relative importance and the situations in 



which either is useful, it is necessary to consider energy, 
the integral of power over time. By reducing the fre- 
quency at which a component operates, a specific oper- 
ation will consume less power but may take longer to 
complete. Although reducing the frequency alone will 
reduce the average power used by a processor over that 
period of time, it may not deliver a reduction in energy 
consumption overall, because the power savings are lin- 
early dependent on the increased time. While greater 
energy reductions can be obtained with slower clocks 
and lower voltager., operations take longer, this exposes 
a fundamental tradeoff* between er:ergy and delay. 

Many systems allow the processor clock to be varied. 
More recently, there are a number of processors that 
allow the processor voltage to be changed. For exam- 
ple, the StrongARM SA-2 processor, currently being 
designed by Intel, is estimated to dissipate 500m W at 
6OOMH2, but only 40mW when running at 150MHz - 
a 1 2-fold energy reduction for a 4-fold performance re- 
duction [1]. Likewise, the Pentium-Ill processor with 
SpeedStep technology dissipates 9W at 500MHz but 
22W at 650MH2 [2], AMD has added clock and voltage 
scaling to the AMD Mobile K6 Plus processor family 
and Transmeta has also developed processors with volt- 
age scaling. Because of this tradeoff in speed vs. power, 
the decision of when to change the frequency or the volt- 
age and frequency of such processors must be made ju- 
diciously while taking into account application demand 
and quality of user experience. 

We believe that the decision to change processor speed 
and voltage must be controlled by the operating system. 
The operating system or similar system software is the 
only entity with a global view of resource usage and de- 
mand. Although it is clear that the operating system 
ishould control the scheduling mechanism, it is not clear 
what inputs are necessary to formulate the scheduling 
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policy. There are two possible sources of information for 
policies. The application can estimate activity, providing 
information to the operating system about computation 
rates or deadlines, or the operating system can attempt to 
infer some policy for the applications from their behav- 
ior. These can be used separately or in concert to control 
voltage and processor speed. 

A number of studies have investigated policies to auto- 
matically infer computation demands and adjust the pro- 
cessor accordingly. We have implemented those previ- 
ously described algorithms; this paper describes our ex- 
perience. 

In the next section, we present some background mate- 
rial. We discuss related work in Section 3. In Section 4 
we describe the schedulers we examine, our workload 
and our measurement methodology. We then discuss our 
results in Section 5. 



2 Background 

To better understand the importance of voltage and clock 
scheduling, we begin by reviewing energy-consumption 
concepts, then present an overview of scheduling algo- 
rithms. Lastly, we give an overview of our test platform, 
the Itsy Pocket Computer. 

2.1 Energy 

the energy E, measured in Joules (J), consumed by 
a computer over T seconds is equal to the integral of 
the instantaneous power, measured in Watts (W). The 
instantaneous power consumed by components imple- 
mented in CMOS, such as microprocessors and DRAM, 
is proportional to V'^ x F, where V is the voltage supply- 
ing the component, and F is the frequency of the clock 
driving the component. Thus, the power consumed by 
a computer to, say, search an electronic phone book, 
may be reduced by reducing V, F, or both. However, 
for tasks that require a fixed amount of work, reducing 
the frequency may result in the system taking more time 
to complete the work. Thus, little or no energy will be 
saved. There are techniques that can result in energy sav- 
ings when the processor is idle, typically through clock 
gating, which avoids powering unused devices. 

In normal usage pocket computers run on batteries, 
which contain a limited supply of energy. However, as 



discussed in [3], in practice, the amount of energy a bat- 
tery can deliver (i.e., its capacity) is reduced with in- 
creased power consumption. As an illustration of this 
effect, consider the Itsy pocket computer that was used 
in this study (described in Section 2.3); When the sys- 
tem is idle, the integrated power manager disables the 
processor core but the devices remain active. If the sys- 
tem clock is 206 MHz, a typical pair of alkaline batteries 
will power the system for about 2 hours; if the system 
clock is set to 59 MHz, those same batteries will last for 
about 18 hours. Although the battery lifetime increased 
by a factor of 9, the processor speed was only decreased 
by a factor of 3.5. The capacity of the battery can also 
be increased by interspacing periods of high power de- 
mand with much longer periods of low power demand 
resulting in a "pulsed power" system [4]: The extent to 
which these two non-ideal properties can be exploited 
is highly dependent on the chemical properties and the 
construction of a battery as well as the conditions un- 
der which the battery is used. In general, the former ef- 
fect (minimizing peak demand) is more important than 
the latter for the domain of pocket computers because 
pulsed power systems need a significant period of time 
to recharge the battery, and most computer applications 
place a more constant demand on the battery. 

If a system allows llie voltage to be reduced v/hen clock 
speed is reduced (i.e. it supports voltage scaling), , it 
is better to reduce the clack speed to the minimum 
needed rather than running at peak speed and then being 
idle. For example, consider a computation that normally 
takes 600 million instructions to complete. That appli- 
cation would take one second on a StrongARM SA-2 at 
600MHz and would consume 500 mJoules. At 150MHz, 
the application would take four seconds to complete, 
but would only consume 1 60 mJoules, a four-fold sav- 
ings assuming that an idle computer consumes no en- 
ergy. There is obviously a significant benefit to running 
slower when the application can tolerate additional de- 
lay. Pering [5] used the term voltage scheduling to mean 
scheduling policies that seek to adjust both clock speed 
and energy. The goal of voltage scheduling is to reduce 
the clock speed such that all work on the processor can 
be completed "on time" and then reduce the voltage to 
the minimum needed to insure stability at that frequency. 

2.2 Clock Scheduling Algorithms 

In scheduling the voltage at which a system operates and 
the frequency at which it runs, a scheduler faces two 
tasks: to predict what the future system load will be 
(given past behavior) and to scale the voltage and clock 
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frequency accordingly. These two tasks are referred to as 
prediction and speed-setting [6], We consider one sched- 
uler better than another if it meets the same deadlines (or 
has the same behavior) as another policy but reduces the . 
clock speed for longer periods of time. 

The schedulers we implemented are interval schedulers, 
so called because the prediction and scaling tasks are 
performed at fixed intervals as the system runs [7]. At 
each interval, the processor utilization for the interval is 
predicted, using the utilization of the processor over one 
or more preceding intervals: We consider two predic- 
tion algorithms originally proposed by Weiser et al [7]: 
PAST and AVGf^. Under PAST, the current interval is 
predicted to be as busy as the immediately preceding in- 
terval, while under AVG, an exponential moving average 
with decay \/V of the previous intervals is used. That is, 
at each interval, we compute a "weighted utilization" at 
time t, 14^4, as a function of the utilization of the previ- 
ous interval L/t-i an^i the previous weighted utilization 
Wt^i, The AVGjyr policy sets M^t = ^""^'j^l^^'" • The 
PAST policy is simply the AVG q policy, and assumes the 
current interval will have the same resource demands as 
the previous interval. 

The decision of whether to scale the clock and/or voltage 
is determined by a pair of boundary values used to pro- 
vide hysteresis to the scheduling policy. If the utilization 
drops below the lower value, the clock is scaled down; 
similarly, if the utilization rises above the higher value, 
the clock is scaled up. Pering et al. [8] set these values at 
50% and 70%. We used those values as a starting point 
but, as we discuss in Section 5.3, we found that the spe- 
cific values are very sensitive to application behavior. 

Deciding how much to scale the processor clock is sep- 
arate from the decision of vv/ze/? to scale the clock up 
(or down). The SA-1 100 processor used in the Itsy sup- 
ports 1 1 different clock rates or "clock steps". Thus, our 
algorithms must select one of the discrete clock steps. 
We use three algorithms for scaling: one, double, and 
peg. The one policy increments (or decrements) the 
clock value by one step. The peg policy sets the clock 
to the highest (or lowest) value. The double policy 
tries to double (or halve) the clock step. Since the low- 
est clock step on the Itsy is zero, we increment the clock 
index value before doubling it. Separate policies may be 
used for seal ing upwards and downwards. 




Figure 1 : Equipment setups used to measure power. 



23 The Itsy Pocket Computer 

The Itsy Pocket Computer is a flexible research plat- 
form, developed to enable hardware and software re- 
search in pocket computing. It is a small, low-power, 
high-performance handheld device with a highly flexible 
interface, designed to encourage the development of in- 
novative research projects, such as novel user interfaces, 
new applications, power management techniques, and 
hardware extehsioriS. Thr^r:: are several versions of the 
basic Itsy design, witii varying amount of RAM, flash 
memory and I/O devices. We used several units for this 
study thai were modified by Compaq Computer Corpo- 
ration's Western Research Lab to include instrumenta- 
tion leads for power measurement. Figure 1 shows the 
units along with the measurement equipment we used. 
We investigate the energy and power consumption of 
the Itsy Pocket Computer when it is run at between 
59 MHz and 206 MHz, and when its StrongARM SA- 
1 100 [9, 10] processor is powered at two different volt- 
age levels. 

All versions of the Itsy are based on the low-power 
StrongARM SA-1 100 microprocessor. All versions 
have a small, high-resolution display, which offers 320 x 
200 pixels on a 0.18mm pixel pitch, and 15 levels of 
greyscale. All versions also include a touchscreen, a mi- 
crophone, a speaker, and serial and IrDA communica- 
tion ports. The Itsy architecture can support up to 128 
Mbytes both of DRAM and flash memory. The flash 
memory provides persistent storage for the operating 
system, the root file system, and other file systems and 
data. Finally, the Itsy also provides a *'daughter card" 
interface that allows the base hardware to be easily ex- 
tended. The Itsy uses two voltage supplies powered by 
the same power source. The processor core is driven by a 
1 .5 V supply while the peripherals are driven by a 3.3 V 
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Figure 2: Itsy System Architecture 

supply. Both power supplies are driven by a single 3.1V 
supply connected to the electrical mains. 

The Itsy version 1 .5 units used as the basis for this work 
have 64 Mbytes of DRAM and 32 Mbytes of flash mem- 
ory. These units were modified to allow us to run the 
StrongARM SA-1100 at either 1.5 V or 1.23 V. Al- 
though 1.23 V is below the manufacturer's specifica- 
tion, it can be safely used at moderate clock speeds and 
pur measurements indicate the voltage reduction yields 
about a. 15% reduction jn the . power consumed by the, 
processor; the percentage of power reduction for the sys- 
tem may be less than this (depending on workload) be- 
cause voltage scaling only reduces the power used by the 
processor. The Itsy can be powered either by an external 
supply or by two size AAA batteries. Figure 2 shows a 
schematic of the Itsy architecture. 

The system software of the Itsy includes a monitor and 
a port of version 2.0.30 of the Linux operating sys- 
tem. The Linux system was configured to provide sup- 
port for networking, file systems and multi-user manage- 
ment. Applications can be developed using a number of 
programming environments, including C, X-Windows, 
SmallTalk and Java. Applications can also take advan- 
tage of available speech synthesis and speech recogni- 
tion libraries. 



3 Related Work 



We believe that our evaluation of dynamic speed and 
voltage setting algorithms to be the first such empirical 
evaluation - to our knowledge, all previous work from 
different groups has relied on simulators [7, 6, 5, 11, 12]; 
none modeled a complete pocket computer or the work- 



load likely to be run on it. 

Weiser et al. [7] proposed three algorithms, OPT, 
FUTURE, and PAST and evaluated them using traces 
gathered from UNIX-based workstatioiis running engi- 
neering applications. These algorithms use an interval- 
based approach that determines the clock frequency for 
each interval. Of the algorithms they propose, oiily 
PAST is feasible because it does not make decisions us- 
ing future information that would not be avai 181)16 to an 
actual implementation. Even so, the actual version of 
PAST proposed by by Weiser et al. is not implementable 
because it requires that the scheduler know the amount 
of work that had to be performed in the preceding in- 
tervals. This information was used by the scheduler to 
choose a clock speed that allows this delayed work to be 
completed in the next interval, if possible. For example, 
suppose post-processing of a trace revealed that the pro- 
cessor was busy 80% of the cycles while running at full 
speed. If, during re-play of the trace, the scheduler opted 
to run the processor at 50% speed for the interval, then 
30% of the work could not be completed in that interyal. 
Consequently, in the next interval, the scheduler would 
adjust the speed in an effort to at least complete the 30% 
"unfinished" work. Without additional information from 
the application, the scheduler can simply obseive that 
the application executed until tht; end of the scl>edv.ling 
quanta, and does not know the amount of "unfinif-hed" 
computing left. Because most pocket computer applica- 
tions do not provide a means for the processor to know 
how much work should be done in a given interval, the 
PAST algorithm is not tractable for such systems. 

The early work of Weiser et al. has been extended by 
several groups, including [6, 12]. Both of these groups 
employed the same assumptions and the same traces 
used by Weiser. Govil et aL [6] considered a large num- . 
ber of algorithms, while Martin [12] revised Weiser's 
PAST algorithm to account for the non-ideal properties 
of batteries and the non-linear relationship between sys- 
tem power and clock frequency. Martin argues that the 
lower bound on clock firequency should be chosen such 
that the number of computations per battery lifetime is 
maximized. While Martin correctly assumed a non-zero 
energy cost for idling the processor and changing clock 
speed, neither Govil nor Weiser did. 

Both our work and that of Pering et al. [5, 1 1] ad- 
dresses some of the limitations of the above noted ear- 
lier work. In particular, we both evaluate implementable 
algorithms using workloads that are representative of 
those that might be run on pocket computers. We as- 
sess the success of our algorithms under the assumption 
that our applications have inelastic performance con- 



straints and that the user should see no visible changes 
induced by the scheduling algorithms. By comparison, 
Pering et al assume that frames of an MPEG video, 
for instance, can be dropped and present results which 
combine a combination of energy savings vs. frame 
rates. Our goal was to understand the performance of 
the different scheduling algorithins without introducing 
the complexity of comparing multi-dimensional perfor- 
mance metrics such as the percentage of dropped frames 
vs. power savings. 

Pering et al. use intervals of 1 0-50ms for their schedul- 
ing calculations. In comparison to the earlier approaches 
presented in [7, 6, 12] in which work was considered 
overdue if it was not completed within an interval, both 
Pering et al. and our study consider an event to have 
occurred on time if delaying its completion did not ad- 
versely affect the user. However, a number of impor- 
tant differences exist between our work and Pering et 
al. First, Pering et al. model only the power consumed 
by the microprocessor and the memory, thus ignoring 
. other system components whose power is not reduced 
by changes in clock frequency. Second, by virtue of 
our work using an actual implementation, we are able 
to evaluate longer running applications and more com- 
plex applications (e.g., Java). :By -virtue of their size, 
our-applications exhibit niore significant memory behav- 
ior, and thus, expose the non-linear relationship between 
. power and clock speed noted by Martin, Lastly, by us- 
ing an actual system, our scheduling implementations 
were exposed to periodic behaviors that are captured 
by traces; for example, the Java implementation uses a 
30ms polling loop to check for I/O events. This periodic 
polling adds additional variation to the clock setting al- 
gorithms, inducing the sort of instability we will explain 
in §5.3. 



4 Methodology 

Before describing the implementation of the clock and 
voltage scheduling algorithms we used, it is important to 
understand how we did our measurements. Section 4.1 
describes how we measure power and energy. We then 
describe the implementation of the schedulers and the 
workloads we used to assess their performance. 



4.1 Measuring Power and Total Energy 

To measure the instantaneous power consumed by the 
Itsy, we use a data acquisition (DAQ) system to record 
the current drawn by the Itsy as it is connected to an ex- 
ternal voltage supply, and the voltage provided by this 
supply. Figure 1 presents a picture of our setup along 
with the wires connected to the Itsy to facilitate mea- 
suring the supply current^ and voltage. We | configured 
the DAQ system to read the voltage 5000 times per sec- 
ond, and convert these readings to 1 6-bit values. These 
values were then forwarded to a host computer, which 
stored them for subsequent analysis. From these mea- 
surements, we can compute a time profile of the power 
used by an application as it runs on the Itsy. 

To determine the relevant part of the power-usage pro- 
file of a workload, we measure the time required to ex- 
ecute the workload and then select the relevant set of 
measurements from the data collected by the DAQ sys- 
tem. For each benchmark, we used the gettimeof - 
day system call to time its execution; this interface uses 
the 3.6 MHz clock available on the processor to provide 
accurate timing information. To synchronize the collec- 
tion of the voltages with the start of execution of a work - 
load, as the v/orkload begins execuiingj vve toggle one 
of the SAllOO's general-purpose input-output (GPIO) ^ 
pins. This pin is connected to the external trigger of the 
DAQ system; toggling the GPIO causes the DAQ system 
to begin recording measurements. As our measurement 
technique is very similar to that which we used in [13], 
we refer the reader to this reference for a more in-depth 
description. 

Once the relevant part of the profile has been deter- 
mined, we use it to calculate the average power and 
the total energy consumed by the Itsy during the cor- 
responding tirne interval. To compute the energy, we 
make the assumption that the power measured at time 
t represents the average power of the Itsy for the inter- 
val t \o t + 0-0002 seconds, where 0.0002 seconds is 
the time between each successive power measurement. 
Thus, the energy E is equal to ]Cr=i Pi(0 ^ 0.0002, 
where . . ■ ,Pn(*) are the n power readings of in- 
terest. 

In making our power measurements, we used a simi- 
lar approach as the one used in [13] to reduce a num- 
ber of sources of possible measurement error. We mea- 

'The supply current was measured by measuring the voltage drop 
across a high precision small-valued resistor of a known resistance 
(0.020). The current was then calculated by dividing the voltage by 
tlie resistance. 



sured multiple runs of each workload; in general, we 
found the 95% confidence interval of the energy to be 
less than 0.7% of the mean energy. This implies that the 
runs were very repeatable, despite the possible variation 
that would arise from interactions between application 
threads, other processes and system daemons. 

4.2 Workload 



We used a varied workload to assess the performance 
of the different clock scaling algorithms. Since it's not 
clear what applications v^rill be common on pocket com- 
puters, we used some obvious applications (web brows- 
ing, text reading) and other less obvious applications 
(chess, mpeg video and audio). The applications ran 
either directly on top of the Linux operating system or 
within a Java virtual machine [14], To capture repeat- 
able behavior for the interactive applications, we used 
a tracing mechanism that recorded timestamped input 
events and then allowed us to replay those events with 
millisecond accuracy. We did not trace the mpeg play- 
back because there is no user interaction, and we found 
little inter-run variance. We used the following applica- 
tions: , . 

MPEG: We played a 320x200 color MPEG- 1 video 
and audio clip at 15 frames a second. The mpeg 
video was rendered as a greyscale image on the 
Itsy. Audio was rendered by sending the au- 
dio stream as a WAV file to an audio player 
which ran as a separate process, forked from the 
video player. There is no explicit synchroniza- 
tion between the audio and video sequences, but 
both are sequenced to remain synchronized at 1 5 
frames/second. The clip is 14 seconds and was 
played in a loop to provide 60 seconds of play- 
back. 

Web: We used a Javabean version of the Ice Web 
browser to view content stored on the itsy. 
We selected a file containing a stored article 
from www. news . com concerning the Itsy. We 
scrolled down the page, reading the full article. 
We then went back to the root menu and opened a 
file containing an HTML version of WRL techni- 
cal report TN-56, which has many tables describ- 
ing characteristics of power usage in Itsy compo- 
nents. The overall trace was 1 90 seconds of activ- 
ity. 

Chess: We used a Java interface to version 1 6. 1 0 of the 
Crafty chess playing program. Crafty was run as 



a separate process. Crafty uses a play book for 
opening moves and then plays for specific periods 
of time in later stages of the games and plays the 
best move available when time expires. The 218 
second trace includes a complete game of Crafty 
playing against a novice player (who lost, badly). 

TalkingEditor: We used a version of the "mpedit" Java 
text editor that had been modified to read text files 
aloud using the DECtalk speech synthesis system 
(which is run in a separate process). The input 
trace records the user selecting a file to be opened 
using the file dialogue, (i.e. moving to the direc- 
tory of the short text file and selecting the file), 
then having it Spoken aloud and finally opening 
and having another text file read aloud. The trace 
took 70 seconds. 

The Kaffe Java system [14] uses a JIT, makes extensive 
use of dynamic shared libraries and supports a threading 
model using setjmp/longjmp. The graphics library used 
by Java is a modified version of the publically available 
GRX graphics library and uses a polling I/O model to 
check for new input every 30 milliseconds. The MPEG 
player renders directly to the display. 

4.3 Implementing the Scheduling Algorithms 

We made two modifications to the Linux kernel to sup- 
port our clock scheduling algorithms and data record- 
ing. The first modification provides a log of the process 
scheduler activity. This component is implemented as 
a kernel module with small code modifications to the 
scheduler that allow the logging to be turned on and 
off. For each scheduling decision, we record the pro- 
cess identifier of the process being scheduled, the time 
at which it was scheduled (with microsecond resolution) 
and the current clock rate. 

We also implemented an extensible clock scaling policy 
module as a kernel module. We modified the clock in- 
terrupt handler to call the clock scheduling mechanism 
if it has been installed, and the Linux scheduler to keep 
track of CPU utilization. In Linux, the idle process al- 
ways uses the zero process identifier. The idle process 
enters a low-power "nap" mode that stalls the processor 
pipeline until the next, scheduling interval. If the previ- 
ous process was not the idle process, the kernel adds the 
execution time to a running total. On every clock inter- 
rupt, this total is examined by the clock scaling module 
and then cleared. The CPU utilization can be calculated 
by comparing the time spent non-idle to the time length 



of a quantum. Our time quantum was set to 10 msec, the 
default scheduling period in Linux; Pering ef al. [5, 1 1] 
used similar values for their calculations. 

Normally, a process can run for several quanta before the 
scheduler is called. The executing process is interrupted 
by the lOOHz system clock when the 0/S decrements 
and examines a counter in the process control block at 
each interrupt. When that counter is zero, the scheduler 
is called. We set the counter to one each time we sched- 
ule a process, forcing the scheduler to be called every 
10ms. While this modification adds overhead to the exe- 
cution of an application, it allows us to control the clock 
scaling more rapidly. We measured the execution over- 
head and found it to be very small (about 6 microseconds 
for each 10ms interval, or 0.06%). 



5 Results 



The puipose of our study is to determine if the heuris- 
tics developed in prior studies can be practically applied 
to actual pocket computers. We examined a number of 
policies, most of which are variants of the AVGjyr policy. 
As described in §4.3, we used' three different speed set- 
ting policies. Our intent was to focus on systems that 
could be implemented in an actual 0/S and that did not 
require modifications to the applications (such as requir- 
ing information about deadlines or schedules). We as- 
sumed that our workloads had inelastic constraints; in 
other words, we assumed the applications had no way to 
accommodate "missed deadlines". 

We split the discussion of our results into three parts. 
The first section describes aspects of the applications 
and how they differ from those used in prior work and 
the second section discusses the performance of the dif- 
ferent clock scheduling algorithms. Finally, we examine 
the benefit of the limited voltage scaling available on the 
Itsy and summarize the results. 



5.1 Application Gharacteristics 



Figure 3 presents plots of the processor utilization over 
time for each of the benchmark applications. This infor- 
mation was gathered using the on-line process logging 
facility that we added to the kernel. Due to kernel mem- 
ory limitations, we could only capture a subset of the 
process behavior. Each application was able to run at 
l32MHz and still meet any user interaction constraints 



(a) MPEG Program at 206MHz 




Time (microseconds) 

(c) Chess Program at 206MHz 



Time (microseconds) 

(d) TalkingEditor Program at 206MHz 

Figure 3: Utilization using 10ms Moving Average For 
Between 30 to 40 Second Intervals Using 206MHz Fre- 
quency Setting 



(i.e. the application did not appear to behave any differ- 
ently). 



The utihzation is computed for each 10ms scheduling 
quantum. We used the same 10ms interval for logging 
that is used for scheduling within Linux. Since most pro- 
cesses compute for several quanta before yielding, the 
systeiii is usually either completely idle or completely 
busy during a given quantum. Some processes execute 
for only a short time then yield the processor prior to 
the end of their scheduling quanta; for example, the Java 
implementation we used has a 30ms 1/0 polling loop - 
thus, when the Java system is. "idle," there is a constant 
polling action every 30ms that takes about a millisecond 
to complete. 

The behavior of the applications is difficult to predict, 
even for applications that should have very predictable 
behavior and each application appears to run at a dif- 
ferent time-scale. The MPEG application renders at 15 
frames/sec; there are 450 frames in the 30 second in- 
terval shown in Figure 3. Each frame is rendered in 
67ms or just under 7 scheduling quanta. Any scheduling 
mechanism attempting to use information from a single 
frame (as opposed to a single quanta) would need to ex- 
amine at least 7 quanta. Other applications have miich 
coarser behavior. For'example, the TalkingEditor appli- 
cation consumes varying amount of CPU time until the 
text is being loaded for speech synthesis. The bursty 
behavior prior to the speech synthesis results from drag- 
ging images, JIT'ing applications and opening files. Fol- 
lowing this are long bursts of computation as the text 
is actually synthesized and send to the OSS-compatible 
sound driver. Finally, more cycles are taken by the sound 
driver. Thus, this application is bursty at a higher level. 

For most applications, patterns in the utilization are eas- 
ier to see if you plot the utilization using a 100ms mov- 
ing average, as shown in Figure 4. The MPEG appli- 
cation, in Figure 4(a), is still very sporadic because of 
inter-frame variation; for MPEG, there is even signif- 
icant variance in CPU utilization (60-80%) when con- 
sidering a 1 second moving average (not shown). The 
Chess and TalkingEditor applications show patterns in- 
fluenced by user interaction. lt*s clear from Figure 4(c) 
that utilization is low when the user is thinking or mak- 
ing a move and that utilization reaches 100% when 
Crafty is planning moves. Likewise, Figure 4(d) shows 
the aforementioned pattern of synthesis and sound ren- 
dering. 
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Figure 4: Utilization using 100ms Moving Average For 
Between 30 to 40 Second Intervals Using 206MHz Fre- 
quency Setting 



5.2 Clock Scheduling Comparison 



The goal of a clock scheduling algorithm is to try to 
predict or recognize a CPU usage pattern and then set 
the CPU clock speed sufficiently high to meet the (pre- 
dicted) needs of that application. Although patterns in 
the utilization are more evident when using a 100ms 
sliding average for utilization, we found that averaging 
over such a long period of time caused us to miss our 
"deadline". In other words, the MPEG audio and video 
became unsynchronized and some others applications 
such as the speech synthesis engine had noticeable de- 
lays. This occurs because it takes longer for the systeni 
to realize it is becoming busy. 

This delay is the reason that the studies of Govil et al. [6] 
and Weiser [7] argued that clock adjustment should ex- 
amine a 10-50ms interval when predicting future speed 
settings. However, as Figure 3 shows, it is difficult to 
jfind any discernible pattern at the smaller time-scales. 
Like Govil et al., we also allowed speed setting to occur 
at any interval; Weiser et al. did not model having the 
scheduler interrupted while an application was running, 
but rather deferred clock speed changes to occur- only 
when a process yielded or began executing in a quanta. 

There are a number of possible speed-setting heuristics 
we could examine; since we were focusing on imple- 
mentable policies, we primarily used the policies ex- 
plored by Pering et al. [5]. We also explored other al- 
ternatives. One simple policy would determine the num- 
ber of "busy" instructions during the previous N 10ms 
scheduling quanta and predict that activity in the next 
. quanta would have the same percentage of busy cycles. 
The clock speed would then be set to insure enough busy 
cycles. 

This policy sounds simple, but it results in exception- 
ally poor responsiveness, as illustrated in Figure 5. Fig- 
ure 5(a) shows the speed changes that would occur when 
the application is moving from period of high CPU uti- 
lization to one of low utilization; the speed changes to 
59MHz relatively quickly because we are adding in a 
large number of idle cycles each quanta. By compar- 
ison, when the application moves from an idle period 
to a fully utilized period, the simple speed setting pol- 
icy makes very slow changes to the processor utilization 
and thus the processor speed increases very slowly. This 
occurs because the total number of non-idle instructions 
across the four scheduling intervals grows very slowly. 
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Figure 5: Simple averaging behavior results in poor poli- 
cies. Each box represents a single scheduling interval, 
and the scheduling policy averages the number of non- 
idle instructions over the four scheduling quanta to select 
the minimum processor speed. To simplify the example, 
we assume each interval is either fully utilized or idle. 
The notation "206/0" means the CPU is set to 206MHz 
and the quanta is idle while "206/1" means the CPU is 
fully utilized. 

5.3 The AVGx^ Scheduler -VI . 



We had initially thought that a policy tsrgetii:g the ?»ec2S- 
sary number of non-idle cycles would result in good be- 
havior, but the previous example highlights why we use 
the speed-setting policies described in §4.3. We used the 
same AVGjy^ scheduler proposed by Govil [6] and Per- 
ing [5] and also examined by Pering et al. in [5]; Per- 
ings later paper in [1 1] did not examine scheduler heuris- 
tics and only used real-time scheduling with application- 
specified scheduling goals. 

Our findings indicate that the AVG^ algorithm can not 
settle on the clock speed that maximizes CPU utilization. 
Although a given set of parameters can result in optimal 
performance for a single application, these tuned param- 
eters will probably not work for other applications, or 
even the same application with different input. The vari- 
ance inherent in many deadline-based applications pre- 
vents an accurate assessment of the computational needs 
of an application. The AVQ^j policy can be easily de- 
signed to ensure that very few deadlines will be missed, 
but this results in minimal energy savings. We use an 
MPEG player as a running example in this section, as 
it best exemplifies behavior that illustrates the multitude 
of problems in past-based interval algorithms. Our in- 
tuition is that if there's is a single application that il- 
lustrates simple, easy-to-predict behavior, it should be 



MPEG. Our measurements showed that the MPEG ap- 
plication can run at 132MHz without dropping frames 
and still maintain synchronization between the audio and 
video. An ideal clock scheduling policy would therefore 
target a speed of 1 32MHz. 

However, without information from the user level appli- 
cation, a kernel cannot accurately determine what dead- 
lines an application operates under. First, an application 
may have different deadline requirements depending on 
its input; for example, an MPEG player displaying a 
movie at 30fps has a shorter deadline than one running 
at 15fps. Although the deadlines for an application with 
a given input may be regular, the computation required 
in each deadline interval can vary widely. Again, MPEG 
players demonstrate this behavior; l-frames (key or ref- 
erence) require, much more computation than P-frames 
(predicted), and do not necessarily occur at predictable 
intervals. 

One method of dealing with this variance is to look at 
lengthy intervals which will, by averaging, reduce the 
variance of the computational observations. Our uti- 
lization plots showed that even using 100ms intervals, 
signiiicant variance is exhibited. In addition to interval 
length, the number of intervals over which we average 
(N) of the AVGu policy can also be manipulated. We 
conducted a comprehensive study and varied the value 
of N from 0 (the PAST policy) to 10 with each com- 
bination of the speed-setting policies {i.e. using "peg" 
to set the CPU speed to the highest point, or "one" to 
increment or decrement the speed). 

Our conclusions from the results with our benchmarks 
is that the weighted average has undesirable behavior. 
The number of intervals not only represents the length 
of interval to be considered; it also represents the lag be- 
fore the system responds, much like the simple averag- 
ing example described above. Unlike that simple policy, 
once AVG;y^ starts responding, it will do so quickly. For 
example, consider a system using an AVG 5 mechanism 
with an upper boundary of 70% utilization and "one" as 
the algorithms used to increment or decrement the clock 
speed. Starting from an idle state, the clock will not scale 
to 206MHz for 120 ms (12 quanta). Once it scales up, 
the system will continue to do so (as the average utiliza- 
tion will remain above 70%) unless the next quantum is 
partially idle. This occurs because the previous history is 
still considered with equal weight even when the system 
is running at a new clock value. 

The boundary conditions used by Pering in [5] result in 
a system that scales more rapidly down than up. Table 1 
illustrates how this occurs. If the weighted average is 
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Table 1 : Scheduling Actions for the AVG 9 Policy, 

70%, a fully active quantum will only increase th^ aver- 
age to 73% while a fully idle quantum >yill reduce it to 
63% - jhus, there is a tendency to redur.e the :;r.C':esr;or 
speed. 

The job of the scheduler is made even more diHiciJii by 
applications that attempt to make their own scheduling 
decisions. For example, the default MPEG player in 
the Itsy software distribution useis a heuristic to decide 
whether it should sleep before computing the next frame. 
If the rendering of a frame completes and the time until 
that frame is needed is less than 12m5, the player en- 
ters a spin loop; if it is greater than 12ms, the player 
relinquishes the processor by sleeping. Therefore, if the 
player is well ahead of schedule, it will show significant 
idle times; once the clock is scaled close to the optimal 
value to complete the necessary work, the work seem- 
ingly increases. The kernel has no method of determin- 
ing that this is wasteful work. 

Furthermore, there is some mathematicar justification 
for our assertion that AVG^ fundamentally exhibits un- 
desirable behavior, and will not stabilize on an optimal 
clock speed, even for simple and predictable workloads. 
Our analysis only examines the "smoothing" portion of 
AVG]\7, not the clock setting policy. Nevertheless, it 
works well enough to highlight the instability issues with 
AVGj\7 by showing that, even if the system is started out 
at the ideal clock speed, AVG/^r smoothing will still result 
in undesirable oscillation. 




A processor workload over time may be treated as a 
mathematical function, taking on a value of 1 when the 
processor is busy, and 0 when idling. Borrowing tech- 
niques from signal processing allows us to characterize 
the effect of AVGj^ on workloads in general as well as 
specific instances. AVGjy^ filters its input using a decay- 
ing exponential weighting function. For our implemen- 
tation, we used a recursive definition in terms of both 
the previous actual (Ut-i) and weighted (Wt-i ) utiliza- 
tions: Wt = : For the analysis, however, 
it is useful to transform this into a less computation- 
ally practical representation, purely in terms of earlier 
unweighted utilizations. By recursively expanding the 
Wt^i term and performing a bit of algebra, this repre- 
sentation emerges: Wt = TvTT I^fc=o(ArTr)''~^'"^^^fc- 
This equation explicitly shows the dependency of each 
Wt on all previous Ut, and makes it more evident that 
the weighted output may also be expressed as the result 
of discretely convolving a decaying exponential func- 
tion with the raw input. This allows us to examine spe- 
cific types of workloads by artificially generating a rep- 
resentative workload and then numerically convolving 
the weighting function with it. We can also get a quali- 
tative feel for the general effects AVG has by moving to 
continuous space.and looking at the Fourier transform of 
a decaying exponential, since convolving two functions 
in the time domain is equivalent to multiplying their cor- 
responding Fourier transforms. 

Lets begin by examining the Fourier transform of a de- 
caying exponential: x{t) = e"'*^u(t), where u{t) is the 
unit step function, 0 for all i < 0 and \ for t > 0. 
This captures the general shape of the AVG weight- 
ing function, shown in Figure 6. Its Fourier transform is 
X{u)) = ijq:^^- The transform attenuates, but does not 
eliminate, higher frequency elements. If the input sig- 
nal oscillates, the output will oscillate as well. As a gets 



smaller the higher frequencies are attenuated to a greater 
degree, but this corresponds to picking a larger value for 
iV in AVGjvr and comes at the expense of greater lag in 
response to changing processor load. 

For a specific workload example, we'll use a simple re- 
peating rectangle wave, busy for 9 cycles, and then idle 
for 1 cycle. This is an idealized version of our MPEG 
player running roughly &t an optimal speed, i.e. ju:>t idle 
enough to indicate that the system isn't saturated, ide- 
ally, a policy should be stable when it has the system run- 
ning at an optimal speed. This implies that the weighted 
utilization should remain in a range that would prevent 
the processor speed from changing. However, as was 
fore-shadowed by our initial qualitative discussion, this 
is not the case. A rectangular wave has many high fre- 
quency components, and these result in a processor uti- 
lization as shown in Figure 7. This figure shows the os- 
cillation for this example, and shows that oscillation oc- 
curs over a surprisingly wide range of the processor uti- 
lization. As discussed earlier, our experiinental results 
with the MPEG player on the Itsy also exhibit this os- 
cillation because that application exhibits the same step- 
function resource demands exhibited by our example. 

We also simulated interval-based averaging policies that 
used a pure average rather than an exponentially de- 
caying weighting function, but our simulations indi- 
cated that that policy would perform no better than the 
weighted averaging policy. Simple averaging suffers 
from the same problems experienced by the weighted 
averaging if you do not average the appropriate period. 
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5.4 Summary of Results 

We are omitting a detailed exposition on the scheduling 
behavior of each scheduhng policy primarily because 
most of them resulted in equivalent (and poor) behavior- 
Recall that the best possible scheduling goal for MPEG 
would be to switch to a 132MHz speed and continue to 
render all the frames at that speed. No heuristic policy 
that we examined achieved this goal. Figure 8 shows 
the clock setting behavior of the best policy we found. 
That policy uses the PAST heuristic {i.e. AVGq) and 
"pegs" the CPU speed either to 206MHz or 59MHz de- 
pending on the weight metric. The bounds on the hys- 
teresis where that a CPU utilization greater than 98% 
would cause the CPU to increase the clock speed and a 
CPU utilization less than 93% would decrease the clock 
speed. 



The PAST policy we described results in a small but sta- 
tistically significant reduction in energy for the MPEG 
application. Allowing the processor to scale the voltage 
when the clock speed drops below 162.2MHz results in 
no statistical decrease. . 

We initially surmised that there is no improvement be- 
cause the cost of voltage and clock scaling on our plat- 
form out-weighs any gains. We measured the cost of 
clock and voltage scaling using the DAQ. To"^^ measure 
clock scaling, we coded a tight loop that switched the 
processor clock as quickly as possible. 

Before each clock change, we inverted the state of a spe- 
cific GPIO and used the DAQ to measure the interval 
with high precision. We took measurements when the 
clock changed across many different clock settings ( e.g. 
from 59 to 206MHz, from 191 to 206MHz and so on). 



This policy is "best" because it never misses any dead- 
line (across all the applications) and it also saves a small 
but significant amount of energy. This last point is il- 
lustrated in Table 2. This table shows the 95% confi- 
dence interval for the average energy needed to run the 
MPEG application. The reduction in energy between 
206MHz and 132MHz occurs because the application 
wastes fewer cycles in the application idle loop used To . 
meet the frame delays for the MPEG clip. A « 8% en- 
ergy reduction occurs when we drop the processor volt- 
age to 1 .23V - this is less than the 1 5% maximum reduc- 
tion we measured because the application uses resources 
(e.g. audio) that are not affected by voltage scaling. 
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Figure 8: Clock frequency for the MPEG application us- 
ing the best scheduling policy from our empirical study 
- the scheduling policy only select 59Mhz or 206MHz 
clock settings and changes clock settings frequently. 
This scheduling policy results in suboptimal energy sav- 
ings but avoids noticeable application slowdown. 




Clock scaling took approximately 200microseconds, in- 
dependent of the starting or target speed. During that 
time, the processor can not execute instructions. Th^s, 
frequency changing varies between 11, 200 clock peri- 
ods at 59MHz and 40, 000 clock periods at 200MH2. 

We measured the time for the voltage to settle follow- 
ing a voltage, change. . It takes. » 250 microseconds to 
reduce voltage from 1.5V to 1.23V; in fact, the volt- 
age slowly reduces, drops below 1.23V and then rapidly 
seules on 1.23V. Voltage increases were effectively in- 
stantaneous. We suspect the slow decay occtu^ because 
of capacitance; many processors use external decoupling 
capacitors to provide sufficient current sourcing for pro- 
cessors that have widely varying current demands. 

These measurements indicate that the time needed for 
clock and voltage changes are less than 2% of the 
scheduling interval; thus, we would be able to change 
the clock or voltage on every scheduling decision with 
less than 2% overhead. The fact that we see little energy 
reduction is related to the limited energy savings possi- 
ble with the voltage scaling available on this platform 
and the efficacy of the policies we explored. 



6 Conclusions and Future Work 



Our implementation results were disappointing to us - 
we had hoped to be able to identify a prediction heuris- 
tic that resulted in significant energy savings, and we 
thought that the claims made by previous studies would 
be bom out by experimentation. Although we have 



Algorithm 


Energy 


Constant Speed @ 206.4 MHz, 1.5 Volts 


85,59-86.49 


Constant Speed @ 1 32.7 MHz, 1 .5 Volts 


79.59 - 80.94 


Constant Speed @ 132.7 MHz, 1 .23 Volts 


73.76 - 74.41 


PAST, Peg - Peg, Thresholds: > 98% scales up, < 
93% scales down, 1 .5 Volts 


85.03 - 85.47 


PAST, Peg - Peg, Thresholds: > 98% scales up, < 
93% scales down, Voltage Scaling @ 162.2 MHz 


84.60 - 85.45 



Table 2: Summary of Performance of Best Clock Scaling Algorithms 
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Figure 9: Non-linear change in Utilization with Clock 
Frequency (in MHz) 
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Table 3: Memory access time in cycles for reading indi- 
vidual words as well as full cache lines. 



found a policy that saves some energy, that policy leaves 
much to be desired. The policy causes many voltage and 
clock changes, which may incurr unnecessary overhead; 
this will be less of a problem as processors are better 
designed to accommodate those changes. However, the 
policy did result in both the most responsive system be- 
havior and most significant energy reduction of all the 
policies we examined. 

As with all empirical studies, there are anomalies in our 
system that we can not explain and that may have influ- 
enced our results. We found that the processor utiliza- 
tion does not always vary linearly with clock frequency. - 
Figure 9 shows the processor Miuizaticn vs. clock fre- . 
quency for the MPEG benchmark. There is a distinct 
"plateau" between 162MHz and 176.9MHz. W^ believe, 
that this delay may be induced by the varying number 
of clock cycles needed for memory accesses as the pro- 
cessor frequency changes, as shown in Table 3. That 
table shows the memory access time for EDODRAM 
for reading individual words or a full cache line; there 
is an obvious non-linear increase between 162MHz and 
176,9MHz. The potential speed mismatch between pro- 
cessor and memory has been noted by others [12], but 
we have not devised a way to verify that this is the only 
factor causing the non-linear behavior we noted. 

This paper is the first step on an effort to provide ro- 
bust support for voltage and clock scheduling within the 
Linux operating system. Although our initial results are 
disappointing, we feel that they serve to stop us from at- 
tempting to devise clever heuristics that could be used 
for clock scheduling. It may well be that Pering [11] 
reached a similar conclusion since their later publica- 
tions discontinued the use of huenstics, but their publi- 
cations don't describe the implementation of their oper- 
ating system design or the rational behind the policies 
used. Furthermore, they don't describe how deadlines 
are to be "synthesized" for applications such as Web, 
TalkingEditor and Web where there is no clear "dead- 
line". 



Our immediate future work is to provide "deadline" 
mechanisms in Linux. These deadlines are not precisely 
the same mechanism needed in a true real-time 0/S - 
in a RTOS, the application does not care if the deadline 
is reached early, while energy scheduling would prefer 
for the deadline to be met as late as possible. A further 
challenge we face will be to find a way to automatically 
synthesize those deadlines for complex applications. 
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